The inherent benefits of the United States University Education System is attracting students from across the continents and cultures. A system which is largely independent from the Federal Government regulations gives them a free hand to offer diversified courses and research opportunities. Moreover, the employment opportunities during/post education are captivating especially for students from developing countries. However, affordability is a big interrogation in front of all aspiring students across the continents. This Data Analysis and Machine Learning modeling research project analyzes the historical data from 10 US state universities with university basic data and course type, student’s demographic data, tuition and boarding fee and pay details from the university Alumini and predict the cost for a prospective out of state/international student.
Knowing the cost of university education is vital for any student to plan for their education. This question becomes more obvious for any out of state student who wants to pursue education in any of the US universities. This paper addresses the question in a more scientific way using the historical data for the out of state students. The secondary objective of this analytics project is for the University to be aware of the cost of education in their institution for them to optimize and keep it at a relatively competitive level for them to attract best talents and at the same time keep their financial side healthy.
There are two datasets provided for study and analysis. 1) Universities data : It has the recent year data of all universities which includes University ranking and demographics, Student demographics, tuition and room/boarding cost for both in_state & out_of_state students and alumni data like pay after graduation and how much percentage of their students make a difference in the world. We take universities data from the given 10 states (New York, Massachusetts, Connecticut, California, Washington, Delaware, Alaska, North Dakota, New Jersey, Maryland) for analysis. 2) tuition income data : This dataset has historical price and cost data for different income levels. net_cost available in this the data set provides the actual cost spent by the student after the scholarship/award. Net average actual cost for recent years will be taken from this data set and added to the university data set for extracting insights and prediction.
Data Analysis is done in a structured way starting from extracting dataset, data pre-processing, data cleaning, Exploratory Data Analysis, visualization to machine learning modeling. Data is extraced and tranformation is performed by selecting only 10 States. Data summary has been created to get a sense of the given data set end extract more insights from it. Insignificant data points from the data set are removed. Missing values are handled as appropriately either by stratified mean imputer or by assigning it with a value based on the domain knowledge of the particular variable. Addionally, the pattern in which the data was missing is also considering for handling missing values. Exploratory Data Analysis is done by applying statistical methods. Data visualization and graphical representation of the given data set is performed to get additional inference from the data like trends and relationships. Related data points are considered for statistical hypotheses to assess the plausibility of the assumptions. Finally, regression and classification Machine learning modeling is done to predict the out_of_state_Cost. Modelling accuracy and residuals are documented and the best fit model has been recommended for use and implementation.
Finding admission in a reputed university with relatively less cost is time consuming tedious research. This Model helps the students to predict and plan for their study and at the same time university can use this Machine Learning model to keep thier cost at an optimal level.
#Loading the universities dataset
data_university <-read.csv("universities.csv", header=T, stringsAsFactors = TRUE)
#Selecting only the required locations
data_univ <- filter(data_university,state %in% c("New York", "Massachusetts", "Connecticut","California","Washington","Alaska","North Dakota", "New Jersey", "Maryland","Delaware"))
#Loading the tution_income dataset
data_tution <-read.csv("tuition_income.csv", header=T, stringsAsFactors = TRUE)
head(data_tution,2)
#Selecting only the required locations with state_code
data_tution <- filter(data_tution,state %in% c("NY", "MA", "CT","CA","WA","AK","ND", "NJ", "MD","DE"))
Note: Tution dataset is selected with required variables and joined here for Problem 6 and Correlation 2 before performing the train test split. Net_Cost of the recent year from the tuition_income data set will be beneficial for predicting Out_of_state_total_cost. Net-cost is the average actually paid by the student after scholarship/award for different income levels average net_cost will help the students to know the actual cost for the university. net cost can also be categorized for different income levels to add additional predictor variables. However, taking the average of net cost will give a fair idea of the average cost to be incurred for the university extracting net_cost from tuition_income data set and add to university dataset
head(data_tution,1)
#select the recent year data data
data_tution_maxdate <- data_tution %>%
select(name, state,total_price,year,campus,net_cost,income_lvl) %>%
group_by(state) %>% filter(year == max(year)) %>% # Take the most recent date per state
group_by()
#take net_Cost mean for all universities
data_tution_cost <-data_tution_maxdate %>%
select(name,total_price,net_cost)%>%
dplyr::filter(net_cost != 0) %>%
group_by(name)%>%
summarise(net_cost = mean(net_cost))
#add net_cost value to university dataset
#left join of university and tuition_cost datasets
data_univ <- left_join(data_univ, data_tution_cost, by = c('name'))
apply(is.na(data_univ),2, sum)
## X name total_enrollment
## 0 0 0
## state n_native n_asian
## 0 0 0
## n_black n_hispanic n_pacific
## 0 0 0
## n_nonresident n_total_minority n_multiracial
## 0 0 0
## n_unknown n_white n_women
## 0 0 0
## state_code type degree_length
## 0 0 0
## room_and_board in_state_tuition in_state_total
## 233 0 0
## out_of_state_tuition out_of_state_total early_career_pay
## 0 0 441
## mid_career_pay make_world_better_percent stem_percent
## 441 444 441
## net_cost
## 64
Details of the dataset: Universities data set (initially) : 2159 observations and 27 variables Tution income data set : 209012 observations and 7 variables Universities data (after filtering 10 states) : 563 observations and 27 variables
count(data_univ,"name")
There are 563 universities in the resulting dataset.
Univ_grad_estPay <- data_univ %>% filter (early_career_pay > 50000)
head(Univ_grad_estPay)
There are 94 universities who have been estimated early career pay greater than 50000 dollars per year. Beside this, it can be observed that all these universities fall under public/private educational institute type.
univ_nonProf <- data_univ %>% filter (type == c('Public','private'))
## Warning in `==.default`(type, c("Public", "private")): longer object length is
## not a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
count(univ_nonProf,"name")
134 Universities fall under the category of not for-profit institutions, with a general assumption that universities which are explicitly mentioned as profit are the ones classified as profit making institutions.
The universities dataset is a list of all US universities from all states predominantly contains 3 components.
a) University location and its enrollment size with type which includes its specialization, ownership status and the degree length they offer
b) Annual Enrollment count with student’s demographic data
c) Fee and cost structure for both in state and out station students
d) Alumni status which includes their pay and how their social status
The US education system is relatively independent from government regulation compared to other countries. It is intensively diversed and most of the universities accept students from all over the world. There are few universities run for profit but largely Private and Public universities are not run for profit. In state students are given benefits in the tution fees in all states. There are several merit scholarship options available for students which will reduce the cost burden for students. Student from all continents prefer to study in US because of a)the diverse course options b) Employment opportunities post study c) Technology advancement and research opportunities
Several universities will provide Both Bachelor and Master degrees, Which is not clear from this data set. Assuming that the university has a 2 yrs course, it offers only a Master degree Assuming that 100% of the seats are available for all students irrespective of demography For profit is explicitly mentioned. Assuming that Private and Public category are ‘Not For Profit’ Assuming that if there is a STEM percentage value, it offers a STEM course. which is not explicitly mentioned
Student choose universities not just based on the cost, the other factors include
1) University Ranking
2) State/city of the University
3) Type Private/Public/For Profit
4) Alumini status(pay and social status)
5) STEM and research opportunities
6) Number of international students
Our target feature Out_of_State_total is more important for Outstation/international students.
Optimize the out-of-state total cost for students to attract best fit students from across the world and also ensure the university is sufficiently funded to focus staff welfare with research & development.
Room and Boarding variables need to be optimized to achieve the goal . In addition, support only the pure merit based students for scholarships to reduce the net cost.
Data Manipulation is the foundation of Data Analysis and Machine learning. It includes cleaning the data, making it readable by transforming or calculating additional variables, removing duplicates and outliers, arranging in order, removing less significant variables, summarizing and making visualizations to get inference, applying statistical methods like mean & median, rename variables to make more sense of data.
Here the total cost of an university education in US is predicted for an out of state/international student. Regression is the more suitable algorithm here as we are going to predict the total cost of out-of-state students with the given features available in the dataset.
#Correcting the state_code for Washington:
data_univ <- within(data_univ, state_code[state_code == 'OH' & state == 'Washington'] <- 'WA')
#Performing summary statistics to analyse the categorical features:
skim (data_univ)
| Name | data_univ |
| Number of rows | 563 |
| Number of columns | 28 |
| _______________________ | |
| Column type frequency: | |
| factor | 5 |
| numeric | 23 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| name | 0 | 1 | FALSE | 562 | Mid: 2, Aca: 1, Ade: 1, Adi: 1 |
| state | 0 | 1 | FALSE | 10 | Cal: 194, New: 121, Mas: 78, Was: 45 |
| state_code | 0 | 1 | FALSE | 10 | CA: 194, NY: 121, MA: 77, WA: 45 |
| type | 0 | 1 | FALSE | 3 | Pub: 271, Pri: 270, For: 22 |
| degree_length | 0 | 1 | FALSE | 2 | 4 Y: 325, 2 Y: 238 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| X | 0 | 1.00 | 1058.44 | 631.69 | 4.00 | 484.50 | 1075.00 | 1597.50 | 2157.0 | ▇▇▇▇▆ |
| total_enrollment | 0 | 1.00 | 6663.93 | 7522.65 | 20.00 | 1556.50 | 4160.00 | 9067.00 | 51237.0 | ▇▂▁▁▁ |
| n_native | 0 | 1.00 | 30.55 | 54.05 | 0.00 | 4.00 | 14.00 | 35.00 | 549.0 | ▇▁▁▁▁ |
| n_asian | 0 | 1.00 | 615.77 | 1227.18 | 0.00 | 42.50 | 167.00 | 582.50 | 10381.0 | ▇▁▁▁▁ |
| n_black | 0 | 1.00 | 617.98 | 1240.94 | 0.00 | 85.00 | 238.00 | 666.50 | 18727.0 | ▇▁▁▁▁ |
| n_hispanic | 0 | 1.00 | 1522.56 | 2708.62 | 0.00 | 127.00 | 372.00 | 1536.50 | 24235.0 | ▇▁▁▁▁ |
| n_pacific | 0 | 1.00 | 22.58 | 42.30 | 0.00 | 1.00 | 6.00 | 26.00 | 405.0 | ▇▁▁▁▁ |
| n_nonresident | 0 | 1.00 | 367.53 | 1027.45 | 0.00 | 13.00 | 72.00 | 230.00 | 11495.0 | ▇▁▁▁▁ |
| n_total_minority | 0 | 1.00 | 3042.35 | 4397.38 | 0.00 | 379.00 | 1123.00 | 3670.50 | 29427.0 | ▇▁▁▁▁ |
| n_multiracial | 0 | 1.00 | 232.91 | 336.50 | 0.00 | 31.50 | 102.00 | 319.00 | 2523.0 | ▇▁▁▁▁ |
| n_unknown | 0 | 1.00 | 356.51 | 566.15 | 0.00 | 48.00 | 162.00 | 427.50 | 5111.0 | ▇▁▁▁▁ |
| n_white | 0 | 1.00 | 2897.53 | 3183.93 | 0.00 | 676.50 | 1920.00 | 3927.00 | 23596.0 | ▇▂▁▁▁ |
| n_women | 0 | 1.00 | 3678.43 | 4109.31 | 0.00 | 878.00 | 2367.00 | 5119.50 | 36116.0 | ▇▁▁▁▁ |
| room_and_board | 233 | 0.59 | 12110.23 | 3764.53 | 30.00 | 10216.50 | 12835.00 | 14814.50 | 21300.0 | ▁▂▅▇▁ |
| in_state_tuition | 0 | 1.00 | 19908.72 | 18309.31 | 1154.00 | 4506.50 | 11500.00 | 35754.00 | 59985.0 | ▇▂▂▂▂ |
| in_state_total | 0 | 1.00 | 27007.07 | 23814.61 | 1154.00 | 4920.00 | 18160.00 | 49146.50 | 75003.0 | ▇▃▂▂▂ |
| out_of_state_tuition | 0 | 1.00 | 23207.69 | 16018.95 | 3460.00 | 9546.00 | 16241.00 | 36158.00 | 59985.0 | ▇▂▂▂▂ |
| out_of_state_total | 0 | 1.00 | 30306.05 | 21682.63 | 3460.00 | 10045.00 | 21286.00 | 49356.00 | 75003.0 | ▇▂▂▂▂ |
| early_career_pay | 441 | 0.22 | 57638.52 | 9526.17 | 39900.00 | 50325.00 | 55100.00 | 63075.00 | 88800.0 | ▃▇▅▁▁ |
| mid_career_pay | 441 | 0.22 | 104952.46 | 18503.68 | 68600.00 | 90000.00 | 101500.00 | 115225.00 | 158200.0 | ▂▇▅▂▁ |
| make_world_better_percent | 444 | 0.21 | 49.73 | 7.87 | 35.00 | 44.00 | 49.00 | 54.00 | 82.0 | ▅▇▃▁▁ |
| stem_percent | 441 | 0.22 | 21.38 | 20.45 | 0.00 | 9.00 | 15.50 | 26.00 | 100.0 | ▇▃▁▁▁ |
| net_cost | 64 | 0.89 | 18172.69 | 10522.02 | 2077.55 | 9180.49 | 15753.05 | 26361.44 | 59655.5 | ▇▅▃▁▁ |
SKIM function provides an overview of the tidy dataset we have for analysis. This summary represents clearly the column type and its frequency in our dataset. We have 2 column types namely factor and numeric. The Variable type and their characters are also described here with the data distribution in a series of histogram for each variable separately. There are 5 categorical variables in the dataset which are name, state, state_code, type and degree_length name - 562 samples, state - 10 samples, state_code - 10 samples, type - 3 samples degree_length - 2 samples
data_univ %>%
keep(is.numeric) %>% # Keep only numeric columns
gather() %>% # Convert to key-value pairs
ggplot(aes(value)) + # Plot the values
facet_wrap(~ key, scales = "free") + # In separate panels
geom_density()
## Warning: Removed 2064 rows containing non-finite values (stat_density).
From the above plots we can conclude that alomst all the variables are not normally distributed. Either they follow a pattern of right skewed or left skewed. Positively skewed distribution or right skewed means that the most frequent values are low which indicates Mode < Median < Mean. Negatively skewed distribution or left skewed means the most frequent values are high;which means Mode > Median > Mean. A normally distributed data will have zero skewness coefficient.
#checking skewness of the target variable
skewness(data_univ$out_of_state_total, na.rm = TRUE)
## [1] 0.5279084
Here we check the skewness of the output variable - out_of_state_total. The skewness coefficient is 0.5729084 which means distribution differs from a normal distribution. The larger the value of skewness, larger the distribution differs from a normal distribution.
#Normalisation
DataExplorer::plot_qq(data_univ[1:10])
DataExplorer::plot_qq(data_univ[11:20])
## Warning: Removed 233 rows containing non-finite values (stat_qq).
## Warning: Removed 233 rows containing non-finite values (stat_qq_line).
DataExplorer::plot_qq(data_univ[21:28])
## Warning: Removed 1831 rows containing non-finite values (stat_qq).
## Warning: Removed 1831 rows containing non-finite values (stat_qq_line).
The qq plot illustrates the normality of the data for each observation. We can see that almost all the variables are mostly non-normal which gives us a hint that normalizing data can be one approach for feature scaling.
#Train test split
set.seed(101)
split.index <- createDataPartition(data_univ$degree_length, p = .7, list = FALSE)
train_data <- data_univ[ split.index,]
test_data <- data_univ[-split.index,]
dim(train_data)
## [1] 395 28
dim(test_data)
## [1] 168 28
# check the proportions
prop.table(table(train_data$degree_length))
##
## 2 Year 4 Year
## 0.4227848 0.5772152
prop.table(table(test_data$degree_length))
##
## 2 Year 4 Year
## 0.422619 0.577381
#Mean of the train dataset - out_of_state_total
print(mean(train_data$out_of_state_total))
## [1] 30179.43
#Mean of the test dataset - out_of_state_total
print(mean(test_data$out_of_state_total))
## [1] 30603.76
#Standard deviation of the train dataset - out_of_state_total
print(sd(train_data$out_of_state_total))
## [1] 21469.93
#Standard deviation of the test dataset - out_of_state_total
print(sd(test_data$out_of_state_total))
## [1] 22236.89
###Data cleaning
Note1: Data cleaning (problem7) is performed here for EDA, better interpretation and visualization purpose.
Note2: Justification for missing value handling method is mentioned in Problem 7
#Handling missing values
#Checking for the NA values
apply(is.na(train_data),2, sum)
## X name total_enrollment
## 0 0 0
## state n_native n_asian
## 0 0 0
## n_black n_hispanic n_pacific
## 0 0 0
## n_nonresident n_total_minority n_multiracial
## 0 0 0
## n_unknown n_white n_women
## 0 0 0
## state_code type degree_length
## 0 0 0
## room_and_board in_state_tuition in_state_total
## 168 0 0
## out_of_state_tuition out_of_state_total early_career_pay
## 0 0 311
## mid_career_pay make_world_better_percent stem_percent
## 311 314 311
## net_cost
## 45
#Mean imputer function:
mean_imputer <- function(x)
{x = round(replace(x, is.na(x) , mean(x, na.rm = TRUE)), digits=0)}
##Handling missing value for room_and_board by stratified mean imputer with respect to state.
train_data <- train_data %>% group_by(state_code) %>%
mutate(room_and_board=ifelse(is.na(room_and_board),mean(room_and_board,na.rm=TRUE),room_and_board))%>%
group_by() #removing groupby
#There are 64 NA values in the newly created net_cost variable due to university name mismatch during Join.
#Mean imputer - net_cost
train_data <- train_data %>% mutate_at(28, mean_imputer)
# Handle NA for early_career_pay, mid_career_pay, make_world_better_percent, stem_percent
train_data[is.na(train_data)] <- 0
#Checking Near Zero Variance for the train dataset
train_data_nzv <- nearZeroVar(train_data)
train_data_nzv
## [1] 26
It can be observed that only one feature has Near Zero Variance. Since it is small it can be neglible.
#Data Cleaning for data_univ as this dataset will be used for data visualization and EDA.
#Treating missing value room_and_board with startified mean imputer
data_univ <- data_univ %>% group_by(state_code) %>%
mutate(room_and_board=ifelse(is.na(room_and_board),mean(room_and_board,na.rm=TRUE),room_and_board))%>%
group_by() #removing groupby
#Treating missing value Net_cost with mean imputer
data_univ <- data_univ %>% mutate_at(28, mean_imputer)
#Replacing missing value cost variables with 0
data_univ[is.na(data_univ)] <- 0
apply(is.na(data_univ),2, sum)
## X name total_enrollment
## 0 0 0
## state n_native n_asian
## 0 0 0
## n_black n_hispanic n_pacific
## 0 0 0
## n_nonresident n_total_minority n_multiracial
## 0 0 0
## n_unknown n_white n_women
## 0 0 0
## state_code type degree_length
## 0 0 0
## room_and_board in_state_tuition in_state_total
## 0 0 0
## out_of_state_tuition out_of_state_total early_career_pay
## 0 0 0
## mid_career_pay make_world_better_percent stem_percent
## 0 0 0
## net_cost
## 0
#Selecting required variables for visualization purpose
data_univ_viz <- data_univ %>%
select( state
, state_code , name , total_enrollment , n_native , n_asian , n_black , n_hispanic , n_pacific , n_nonresident , n_total_minority
, n_multiracial , n_unknown , n_white , n_women ,room_and_board ,in_state_tuition ,in_state_total ,out_of_state_tuition
,out_of_state_total ,net_cost ,early_career_pay ,mid_career_pay ,stem_percent ,make_world_better_percent)
#Update short name and State Code for the universities
data_univ_viz <- data_univ_viz %>%
mutate(univ_short_name= paste(str_extract(data_univ_viz$name,"(\\w+)"),data_univ_viz$state_code) )
Observation and inferences are given below each visualization.
#Visualization 1 : To find outliers in the target variable
ggplot(data_univ_viz) +
aes(x = "", y = out_of_state_total) +
geom_boxplot(fill = "#0c4c8a") +
ggtitle("Distribution plot for out_of_state_total") +
theme_minimal()
The box plot describes the range of the target variable. We can observer that there are no outliers and the range lies approximately between 0 and 8000
#Visualization 2
data_univ_usmap <-data_univ_viz %>%
select(state,state_code,out_of_state_total)%>%
group_by(state_code)%>%
summarise(out_of_state_total = round(mean(out_of_state_total)),2)
#US Map with average out_of_state_total
plot_ly(
type = "choropleth",
locations = data_univ_usmap$state_code,
locationmode = "USA-states",
z = data_univ_usmap$out_of_state_total) %>%
layout(geo = list(scope = "usa")) %>%
layout(title = 'Given 10 US states with thier out_of_state_total', plot_bgcolor = "#e5ecf6")
This shows the world map of US with the out_of_state_total for each state.
#Visualization 3
#Scatterplot - out_of_state_total and total_enrollment with respect to each state
ggplot(data = data_univ) +
geom_point(mapping = aes(x = total_enrollment, y = out_of_state_total, color = state)) +
ggtitle("Scatterplot - Total enrollment with out_of_state_total with respect to each state")
This scatterplot is represented with each state available in the dataset. We can see that almost all the state with less enrollment provide high out_of_state_total. There exist few exceptions like new york, North Dakoda, California which has increased number of total_enrollment with increased out_of_state_total.
#Visualization 4
#scatterplot3d:
attach(data_univ_viz)
scatterplot3d(make_world_better_percent, stem_percent, out_of_state_total,
pch=16,
highlight.3d=TRUE,
type="h",
main="3D Scatterplot - STEM & Make world better",
phi = 0,
ticktype = "detailed")
A 3D scatter plot is plotted with numerical variables. Here x and y axis is taken as make_world_better_percent and stem_percent with respect to our target variable in the z axis which is out_of_state_total. Here we can see all the data points are rising with increase in out_of_state_total which also denotes a positive correlation between them. Here we can observe that as make_world_better_percent increases as stem_percent increases, which means students from STEM make the world a better place to live in.
#Visualization 5
#Select To5 and bottom 5 universities for out_of_state_total
data_high_Low <- data_univ_viz %>%
select(univ_short_name,out_of_state_total)
#Top 5 and low 5 univ with wrt cost
data_univ_viz_high <- data_high_Low %>% top_n(5)
## Selecting by out_of_state_total
data_univ_viz_low <- data_high_Low %>% top_n(-5)
## Selecting by out_of_state_total
#Combine top and bottom countries
data_univ_viz_high_low<-bind_rows(data_univ_viz_high, data_univ_viz_low)
#Color for bar chart
coul <- brewer.pal(5, "Set2")
barplot(data_univ_viz_high_low$out_of_state_total,
main = "Top 5 and low 5 Universities - Out_of_state_total_fee",
ylab = "",
xlab = "Universities",
names.arg = data_univ_viz_high_low$univ_short_name,
col = coul,
width=c(1),
density=c(70),
las=1,
horiz = FALSE
)
The barplot represents the most expensive universities and the low cost universities with the given states.
#Visualization 6
#Botom 5- low expensive Univ with all demo units
data_univ_viz_low<- data_univ_viz[order(data_univ_viz$out_of_state_total),][1:5,]
data_univ_viz_low_demo <- data_univ_viz_low %>%
select(univ_short_name,n_native, n_asian, n_black,n_women)
#convert to Pivot longer for plot
data_univ_viz_low_PL <- data_univ_viz_low_demo %>%
pivot_longer(cols=-c("univ_short_name"), names_to = "Index" , values_to = "Value")
ggplot_all_param <- ggplot(data=data_univ_viz_low_PL, aes(x=univ_short_name, y=Value, fill=Index)) +
geom_bar(stat="identity", position=position_dodge())+
scale_fill_brewer(palette="Paired")+
scale_fill_manual(values = coul)+
theme_minimal()
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
ggplot_all_param + ggtitle("Low expensive Universities with Demographics features")
This bar graph represents the low cost universities and how the demographic features are distributed among them. We can observe a high range in women who have enrolled.
#Visualization 7
#Top10 High Expensive Universities - Total enrollment with all demographic features
data_univ_viz_high<- data_univ_viz[order(-data_univ_viz$out_of_state_total),][1:10,]
data_univ_viz_high_demo <- data_univ_viz_high %>%
select(univ_short_name,total_enrollment,n_women,n_total_minority,n_native,n_asian,n_hispanic,n_pacific,n_nonresident,n_black,n_white,
n_multiracial,n_unknown)
#convert to Pivot longer for plot
data_univ_viz_high_PL <- data_univ_viz_high_demo %>%
pivot_longer(cols=-c("univ_short_name"), names_to = "Index" , values_to = "Value")
#Horizontal Plot
ggplot(data_univ_viz_high_PL[order(data_univ_viz_high_PL$Index, decreasing = T),],
aes(Value, y=reorder(univ_short_name, -Value), label= round(Value,2), fill=factor(Index, levels=c("total_enrollment","n_women","n_total_minority","n_native","n_asian","n_hispanic","n_pacific","n_nonresident","n_white","n_black",
"n_multiracial",",n_unknown")))) +
geom_bar(stat = "identity") +
scale_fill_discrete(name = "Category", labels = c("total_enrollment" ,"n_women" ,"n_total_minority" ,"n_native" ,"n_asian", "n_hispanic", "n_pacific","n_nonresident","n_white","n_black","n_multiracial",",n_unknown"))+
scale_x_continuous(breaks=c(20000,40000,60000,80000,100000,120000,140000,160000,180000)) +
ylab("Universities") +
ggtitle("Top10 Expensive Universities with Demographics features")
#Note: Applied log for all variables to fit them into scale for the bar chart visualization.
This bar graph shows the distribution of women and minority group in expensive universities. We can observe that these two groups plays a major part in the dirtibution. Here major contribution can be observed from the non-native students. There is no much difference between number of white and black. We can see that both the community contribute almost same amount in occupying the expensive universities.
#Visualization 8
#Plotting categorical variable - Type
plot(data_univ$type,col = c("red", "blue","green"))
This plot shows that public and private universities are high.
#Visualization 9
#Plotting categorical variable - Degree Length
plot(data_univ$degree_length, col = c("pink", "purple"))
This plot shows that degree type 4 is higher than degree 2.
With reference to visualization non_residents, women and minority group plays a major contribution to the expensive universities While number of native students are very low in number, Asian student are 5 to 10% and Spanish students are close or little less relative to Asian students.
We can observe that specific groups occupy more in the low expensive universities. This might be because, 1. Affording high tution fee might be a challenge 2. Cost of living in city (Room and board) can be afforded by them.
#Visualization 10
#Expensive states - fee components and actual cost
data_univ_viz_high<- data_univ_viz[order(-data_univ_viz$out_of_state_total),][1:10,]
data_univ_viz_high_cost <- data_univ_viz_high %>%
select(univ_short_name,room_and_board,in_state_tuition,out_of_state_tuition,net_cost)
#convert to Pivot longer for plot
data_univ_viz_high_PL <- data_univ_viz_high_cost %>%
pivot_longer(cols=-c("univ_short_name"), names_to = "Index" , values_to = "Value")
ggplot_all_param <- ggplot(data=data_univ_viz_high_PL, aes(x=univ_short_name, y=Value, fill=Index)) +
geom_bar(stat="identity", position=position_dodge())+
scale_fill_brewer(palette="Paired")+
scale_fill_manual(values = coul)+
theme_minimal()
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
ggplot_all_param + ggtitle("Expensive Universities - 4 Cost factors")
This graph demonstrates the four cost factors in the expensive universities which is the main contribution factor for a university selection. We can see that in_state_tuition and out_of_state_tuition are super expensive comparing to other costs.
#Plotting a correlation map
data_univ_corr <- data_univ[c(3,5:15,19:22,24:28,23)]
data_univ_corr <- cor(data_univ_corr)
corrplot(data_univ_corr, method = "circle")
It can be observe that the following variables have strong correlation with the target variable, 1) instate_Tution 2_ instate_Total 3) outntate_Tution
#Correlation and scatterplot
data_univ_corr2 <- data_univ[c(13:15,19:23)]
ggpairs(data_univ_corr2,
columns = c("in_state_tuition", "in_state_total", "out_of_state_tuition","out_of_state_total"),
upper = list(continuous = wrap("cor", size = 10)),
lower = list(continuous = "smooth"))
Form the correlation and scatter plot we can see that all the variables have same trend. This might be because spurious correlation which exhibits between the data. One reason for it might be residuals exhibit autocorrelation.
From the correlation plot, three features which correlates with make_world_better_percent are, early_career_pay mid_career_pay stem_percent
Hypothesis testing is a statistical test performed to prove an idea or an assumption by using sample data.
Hypothesis Statement:
H0 (Null hypothesis) : There is no linear relationship between two variables
H1 (Alternate hypothesis) : There is a linear relationship between two variables
The H0 and H1 are mutually exclusive, and only one can be true. However, one of the two hypothesis will always be true
Shapiro Normality test:
It is required to check the normality of the data before a hypothesis is performed.Here we have chosen Shapiro-Wilk normality test for evaluation as they are widely recommended and more powerful than the other tests. Shapiro-Wilk normality test depends on the correlation between the data and their normal scores.
#Shapiro test - to check the normality of the data before performing hypothesis testing:
shapiro.test(data_univ$early_career_pay)
##
## Shapiro-Wilk normality test
##
## data: data_univ$early_career_pay
## W = 0.54847, p-value < 2.2e-16
shapiro.test(data_univ$mid_career_pay)
##
## Shapiro-Wilk normality test
##
## data: data_univ$mid_career_pay
## W = 0.54951, p-value < 2.2e-16
shapiro.test(data_univ$stem_percent)
##
## Shapiro-Wilk normality test
##
## data: data_univ$stem_percent
## W = 0.41277, p-value < 2.2e-16
It is a better a approach to perform the normality testing before we proceed with hypothesis testing. Here we perform Shapiro test which states us if the variable is normally distributed or not. After performing the test it can be observed that the p value for all the 3 variables is 2.2e-16 which is less than the significance value of 0.05. This means the variables are Non-Normally distributed Since the variables follow non normal distribution we can perform spearman correlation test which does not assume normality between the variables.
H0 (Null hypothesis) : Students having early_career_pay does not contribute to STEM percentage
H1 (Alternate hypothesis) : Students having early_career_pay contribute to STEM percentage
hyp_test_1 <- cor.test(data_univ$early_career_pay, data_univ$stem_percent,
method = "spearman", exact = FALSE)
hyp_test_1
##
## Spearman's rank correlation rho
##
## data: data_univ$early_career_pay and data_univ$stem_percent
## S = 1080122, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9636838
ggscatter(data_univ, x = "stem_percent", y = "early_career_pay",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "spearman",
xlab = "stem_percent", ylab = "early_career_pay")
## `geom_smooth()` using formula 'y ~ x'
The p-value here is 2.2e-16, which is less than the significance level of 0.05. We reject H0 and accept H1. From this it can concluded that Students having early_career_pay contribute to STEM percentage
H0 (Null hypothesis) : stem_percent does not have a linear relation with mid_career_pay
H1 (Alternate hypothesis) : stem_percent has a linear relation with mid_career_pay
hyp_test_2 <- cor.test(data_univ$mid_career_pay, data_univ$stem_percent,
method = "spearman", exact = FALSE)
hyp_test_2
##
## Spearman's rank correlation rho
##
## data: data_univ$mid_career_pay and data_univ$stem_percent
## S = 1071238, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9639825
The p-value here is 2.2e-16, which is less than the significance level of 0.05. We reject H0 and accept H1. From this it can concluded that stem_percent has a linear relation with mid_career_pay
H0 (Null hypothesis) : Student who contribute to make_world_better_percent are not from stem percent
H1 (Alternate hypothesis) : Student who contribute to make_world_better_percent are not from stem percent
hyp_test_3 <- cor.test(data_univ$stem_percent, data_univ$make_world_better_percent,
method = "spearman", exact = FALSE)
hyp_test_3
##
## Spearman's rank correlation rho
##
## data: data_univ$stem_percent and data_univ$make_world_better_percent
## S = 1631798, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9451352
The p-value here is 2.2e-16, which is less than the significance level of 0.05. We reject H0 and accept H1. From this it can concluded that Student who contribute to make_world_better_percent are not from stem percent
H0 (Null hypothesis) : Students from who make_world_better_percent have no early_career_pay
H1 (Alternate hypothesis) : Students from who make_world_better_percent have early_career_pay
hyp_test_4 <- cor.test(data_univ$early_career_pay, data_univ$make_world_better_percent,
method = "spearman", exact = FALSE)
hyp_test_4
##
## Spearman's rank correlation rho
##
## data: data_univ$early_career_pay and data_univ$make_world_better_percent
## S = 1067610, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9641045
The p-value here is 2.2e-16, which is less than the significance level of 0.05. We reject H0 and accept H1. From this it can concluded that Students from who make_world_better_percent have early_career_pay
Hypothesis 3 seems to be more plausible. This can be from a general and domain knowledge that students who engage themselves more in STEM are the one who contribute more in making the world a better place. To illustrate, Engineering and Technology has been high evolving and emerging which contributes more to human advancement.
Employmeynt or placement data of the Alumini could have been provided in the dataset, which might have contributed for hypothesis likely to be true.
Correlation establishes relationship between variables. It is a measure of the extent in which two variables are related to each other. Correlation can be measured using below tests, which are
Pearson correlation coefficients
Spearman correlation coefficients
Kendall correlation
Pearson correlation coefficients:
- It is a Parametric test
- It can be used to find correlation between two quantitative variables in a population.
- Pearson correlation evaluates the linear relationship between two continuous variables.
- To perform Pearson correlation, the variables has to be normally distributed.
Spearman correlation coefficients:
- It is a non-parametric test to evaluates a monotonic relationship.
- It can be used for both quantitative and qualitative variables in a population.
- It does not rely on normality of the data
We use Spearman correlation here since the data is not normally distributed. Eventhough pearson is widely used, Spearman correlation is more feasible for our dataseta as it does not assume any normality between the variables.
Combining features or calculating new variables will give additional data points for prediction. Additionally it will make the predectors more stronger which can contribute to the Calculating n-men from total enrollment and n_women
#Create new attributes by combining different features with each other.
data_univ <- data_univ %>%
mutate(n_men = total_enrollment - n_women)%>%
mutate(n_native_fee = n_native * in_state_tuition) %>%
mutate(n_nonresident_fee = n_nonresident * out_of_state_tuition)
Number of men is calculated form the available data. Other possible new feature could be from n_native fee and n_nonresident_fee which might be beneficial for the organizational growth.
cor_data <- data_univ %>%
select(c(3,5,10,12,14,15,28)) #[c(3,5,10,12,14,15,28,23)]
correlation = cor(cor_data,data_univ$out_of_state_tuition, method = "spearman")
names = rownames(correlation)
abs_cor = abs(correlation)
data_cor = data.frame(X_var = names,abs_cor = abs_cor,cor = correlation)
data_cor[order(data_cor$abs_cor,decreasing = TRUE),]
Correlation plot for the target feature with the n-Women and n-men show that there is no significant difference, which means it does not show a strong correlation with the target variable.
Note: Tution_income and universities data is already joined in problem 1 before data processing, analysis and EDA.
#correlation Table - Descending
tution_cor_data <- data_univ %>%
select(c(3,19:22,24:28))
correlation_tut = cor(tution_cor_data,data_univ$out_of_state_tuition, method = "spearman")
names = rownames(correlation_tut)
abs_cor_tut = abs(correlation_tut)
data_cor_tut = data.frame(X_var = names,abs_cor = abs_cor_tut,cor = correlation_tut)
data_cor_tut[order(data_cor_tut$abs_cor,decreasing = TRUE),]
It can be observed that the newly added feature net_cost has strong positive correlation with the target variable
Ref : https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_GDP_per_capita Top10 per capita GDP 2019 ranked states with 1 outlier removed The outlier is District of Columbia - capital city of the United States of America, not part of any state with extremely high GDP of $200+K with just 0.69 million population.
data_univ_gdp <- data_univ %>%
mutate(state_gdp2019 = case_when(state == "New York" ~ 90043, state == "Massachusetts" ~ 86942, state == "Connecticut" ~ 81055,
state == "California" ~ 80563, state == "Washington" ~ 80170, state == "Delaware" ~ 78468,
state == "Alaska" ~ 76220, state == "North Dakota" ~ 75321, state == "New Jersey" ~ 73451,
state == "Maryland" ~ 71838, TRUE ~ 0))
plot_ly(data_univ_gdpmap, z=data_univ_gdpmap$state_gdp2019, locations=data_univ_gdpmap$state_code, text=paste0(data_univ_gdpmap$state_code, '<br>Out_of_state_total $: ', data_univ_gdpmap$out_of_state_total),
type="choropleth", locationmode="USA-states", colors = 'pink', filename="stackoverflow/simple-choropleth") %>%
layout(geo = list(scope="usa")) %>%
layout(title = 'Top 10 U.S. States by GDP per capita with Out of state total cost', plot_bgcolor = "#e5ecf6")
All the 10 states having well reputed and expensive universities,which is related to the states economy and GDP. Moreover, it might be correlated with the target variable out_of_state_total cost
GDP is the measure of state’s economy which generally has a linear relationship with education and cost of living. GDP score will be a useful predictor for out_of_state total cost. state_gdp2019 has been added as a new feature to the training and testing dateset in subsequent steps for analysis, correlation and modelling.
#Finding correlation for the GDP values with the target variable to decide if it has to be added to your dataset as a new feature.
tution_cor_data <- data_univ_gdp %>%
select(c(32))
correlation_tut = cor(tution_cor_data,data_univ_gdp$out_of_state_tuition, method = "spearman")
names = rownames(correlation_tut)
abs_cor_tut = abs(correlation_tut)
data_cor_tut = data.frame(X_var = names,abs_cor = abs_cor_tut,cor = correlation_tut)
data_cor_tut[order(data_cor_tut$abs_cor,decreasing = TRUE),]
The variables state_gdp2019 and out_of_state_tuition exhibits a positive correlation of 0.2803806 which is significantly good.
#Adding GDP to the train dataset
train_data <- train_data %>%
mutate(state_gdp2019 = case_when(state == "New York" ~ 90043, state == "Massachusetts" ~ 86942, state == "Connecticut" ~ 81055,
state == "California" ~ 80563, state == "Washington" ~ 80170, state == "Delaware" ~ 78468,
state == "Alaska" ~ 76220, state == "North Dakota" ~ 75321, state == "New Jersey" ~ 73451,
state == "Maryland" ~ 71838, TRUE ~ 0))
GDP is one predominant factor which makes students to enrolled and receive high standard education from the top ranked universities which makes high qualified scholars. This also is diecrly proportional to STEM percent.
Note: Missing values were already handled in problem 3 (EDA and Data preprocessing), since it is a good approach to clean the data before performing EDA and visualization. This will help for better interpretation and getting more insights from the data.
Missing values:
Missing values are observation which has no data. Processing it without taking care of it might lead to poor performance of the statistical test and machine learning models. It might lead to more biased variable leading to inaccurate results. The following variables had NA which was treated using mean inputation and filling with zero.
room_and_board (Stratified Mean imputation)
early_career_pay (Replaced NA with 0)
mid_career_pay (Replaced NA with 0)
make_world_better_percent (Replaced NA with 0)
stem_percent (Replaced NA with 0)
Missing values are of 3 types.
1. Missing completely at random (MCAR) : There is no relationship between the observed and unobserved values which are missing.
2. Missing at random : There is a relationship between the observed and unobserved values as there exist no pattern between them.
3. Missing not at random : The missing data is related to the unobserved data.
4. Structurally missing data: There is a logical reason behind the non-existence of data.
There are several approaches to handle missing data:
1. Deletion methods to remove missing values.
2. Data imputation techniques.
3. Regression analysis to systematically eliminate data.
Treating missing values for room_and_board: While analysing the missing values it was found that Structurally missing data (There is some pattern in the missing values). This can be filled with stratified mean.
Treating missing values for early_career_pay,mid_career_pay,make_world_better_percent and STEM percent: Missing values here falls under Structurally missing data. One reason might be because form the universities name it can be generally assumed that there might be Arts and Management Universities and no Engineering and Technology which contributes zero towards STEM percent. Which indrectly states that this is the major reason for other values missing as well. Also it was observed that more than 80% of the data was missing here. Better approach is to drop the variable. However, these features can be a strong predictors. Therefore, it is better approach to fill it with Zero.
A one-hot encoding is an approach to handle categorical variables. This approach assigns a unique value for each possible observation.
#define one-hot encoding function
#Selecting only required variables for Modelling ML:
train_data <- train_data %>%
select(-c (X,name,state, state_code,in_state_tuition,in_state_total,out_of_state_tuition, net_cost))
colnames(train_data)
## [1] "total_enrollment" "n_native"
## [3] "n_asian" "n_black"
## [5] "n_hispanic" "n_pacific"
## [7] "n_nonresident" "n_total_minority"
## [9] "n_multiracial" "n_unknown"
## [11] "n_white" "n_women"
## [13] "type" "degree_length"
## [15] "room_and_board" "out_of_state_total"
## [17] "early_career_pay" "mid_career_pay"
## [19] "make_world_better_percent" "stem_percent"
## [21] "state_gdp2019"
#Scaling
train_data[c(1:12,15,17:21)] <- scale(train_data[c(1:12,15,17:21)])
#One-hot encoding
encoding_train <- dummyVars(~ ., data = train_data, fullRank = TRUE)
train_data <- data.frame(predict(encoding_train, newdata = train_data))
Feature scaling is a method to normalise the distribution of the data. It plays a vital role in feature engineering.
There are several methos in feature scaling. Some of it are,
1. Normalization
2. Standardization
3. Absolute Maximum Scaling
4. Min-Max Scaling
5. Robust Scaling
We have selected Min-Max Scaling for this dataset because while referring to the summary statistics, there is a wide difference in the ranges of the obeservation between the variables. One optimal approach to handle this is the Min-Max Scaling. This is the simple method which scales the values of the observation between -1 to 1.This will help to make all the values to be in same scale.This also helps in handling outliers.
set.seed(4543)
data_imp_rf <- randomForest(out_of_state_total ~ ., data=train_data, ntree=1000, keep.forest=FALSE,
importance=TRUE)
varImpPlot(data_imp_rf)
From the above table %IncMSE is the most strong and informative measure. An increased mse of predictions defines that the predictors are permuted which means the values randomly shuffled. The higher number, the more important a variable is. room_and_board , degree_length.4.Year , type.Private are first 3 factors which has high importance. %IncMSE is directly proportinal to IncNodePurity.
Here we perform the same data cleaning method which was done for train dataset.
# Data cleaning for test dataset
#Steo1:
#Adding GDP to the test dataset
test_data <- test_data %>%
mutate(state_gdp2019 = case_when(state == "New York" ~ 90043, state == "Massachusetts" ~ 86942, state == "Connecticut" ~ 81055,
state == "California" ~ 80563, state == "Washington" ~ 80170, state == "Delaware" ~ 78468,
state == "Alaska" ~ 76220, state == "North Dakota" ~ 75321, state == "New Jersey" ~ 73451,
state == "Maryland" ~ 71838, TRUE ~ 0))
#Step2: Handling missing values:
apply(is.na(test_data),2, sum)
## X name total_enrollment
## 0 0 0
## state n_native n_asian
## 0 0 0
## n_black n_hispanic n_pacific
## 0 0 0
## n_nonresident n_total_minority n_multiracial
## 0 0 0
## n_unknown n_white n_women
## 0 0 0
## state_code type degree_length
## 0 0 0
## room_and_board in_state_tuition in_state_total
## 65 0 0
## out_of_state_tuition out_of_state_total early_career_pay
## 0 0 130
## mid_career_pay make_world_better_percent stem_percent
## 130 130 130
## net_cost state_gdp2019
## 19 0
#Filling missing value for room_and_board with stratified mean imputer
test_data <- test_data %>% group_by(state_code) %>%
mutate(room_and_board=ifelse(is.na(room_and_board),mean(room_and_board,na.rm=TRUE),room_and_board))%>%
group_by() #removing groupby
#Filling missing value for net_cost with Mean
test_data <- test_data %>% mutate_at(28, mean_imputer) #for net_cost
#Filling missing value with zero for early_career_pay, mid_career_pay, make_world_better_percent, stem_percent
test_data[is.na(test_data)] <- 0 #early_career_pay, mid_career_pay, make_world_better_percent, stem_percent
apply(is.na(test_data),2, sum)
## X name total_enrollment
## 0 0 0
## state n_native n_asian
## 0 0 0
## n_black n_hispanic n_pacific
## 0 0 0
## n_nonresident n_total_minority n_multiracial
## 0 0 0
## n_unknown n_white n_women
## 0 0 0
## state_code type degree_length
## 0 0 0
## room_and_board in_state_tuition in_state_total
## 0 0 0
## out_of_state_tuition out_of_state_total early_career_pay
## 0 0 0
## mid_career_pay make_world_better_percent stem_percent
## 0 0 0
## net_cost state_gdp2019
## 0 0
#Step3: Selecting only required variables for Modelling ML:
test_data<- test_data%>%
select(-c (X,name,state, state_code,in_state_tuition,in_state_total,out_of_state_tuition, net_cost))
colnames(train_data)
## [1] "total_enrollment" "n_native"
## [3] "n_asian" "n_black"
## [5] "n_hispanic" "n_pacific"
## [7] "n_nonresident" "n_total_minority"
## [9] "n_multiracial" "n_unknown"
## [11] "n_white" "n_women"
## [13] "type.Private" "type.Public"
## [15] "degree_length.4.Year" "room_and_board"
## [17] "out_of_state_total" "early_career_pay"
## [19] "mid_career_pay" "make_world_better_percent"
## [21] "stem_percent" "state_gdp2019"
colnames(test_data)
## [1] "total_enrollment" "n_native"
## [3] "n_asian" "n_black"
## [5] "n_hispanic" "n_pacific"
## [7] "n_nonresident" "n_total_minority"
## [9] "n_multiracial" "n_unknown"
## [11] "n_white" "n_women"
## [13] "type" "degree_length"
## [15] "room_and_board" "out_of_state_total"
## [17] "early_career_pay" "mid_career_pay"
## [19] "make_world_better_percent" "stem_percent"
## [21] "state_gdp2019"
#Step4: Checking Near Zero Variance for the test dataset
test_data_nzv <- nearZeroVar(test_data)
print(test_data_nzv)
## integer(0)
#Step5: Scaling the test data
test_data[c(1:12,15,17:21)] <- scale(test_data[c(1:12,15,17:21)])
#Step6: Perform one hot encoding
encoding_test <- dummyVars(~ ., data = test_data , fullRank = TRUE)
test_data <- data.frame(predict(encoding_test , newdata = test_data ))
#Linear model:
set.seed(123)
fitControl <- trainControl(method = "none")
lm_reg_model <- train(out_of_state_total ~.,
data = train_data,
method = "lm",
trControl = fitControl)
pred_lm_reg_model = predict(lm_reg_model, test_data)
## Warning in stats::predict.lm(object, ...): prediction from a rank-deficient fit
## may be misleading
summary(lm_reg_model)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34566 -3922 -464 5173 24586
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15560.7 2945.6 5.283 2.16e-07 ***
## total_enrollment 4438.6 4357.9 1.019 0.3091
## n_native 1323.6 669.1 1.978 0.0486 *
## n_asian -4466.0 4845.5 -0.922 0.3573
## n_black -4713.6 4321.7 -1.091 0.2761
## n_hispanic -8950.8 9982.8 -0.897 0.3705
## n_pacific -1800.0 909.7 -1.979 0.0486 *
## n_nonresident 1321.0 991.3 1.333 0.1835
## n_total_minority 11070.1 16353.2 0.677 0.4989
## n_multiracial NA NA NA NA
## n_unknown -571.2 788.4 -0.725 0.4692
## n_white NA NA NA NA
## n_women -1438.1 3769.7 -0.381 0.7031
## type.Private 13142.8 2771.7 4.742 3.01e-06 ***
## type.Public -315.5 2988.2 -0.106 0.9160
## degree_length.4.Year 14518.9 1812.3 8.011 1.45e-14 ***
## room_and_board 7067.7 564.9 12.511 < 2e-16 ***
## early_career_pay -14868.5 14639.0 -1.016 0.3104
## mid_career_pay 20610.4 14435.5 1.428 0.1542
## make_world_better_percent -1561.5 1698.4 -0.919 0.3585
## stem_percent -56.7 880.4 -0.064 0.9487
## state_gdp2019 921.5 573.3 1.607 0.1088
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9749 on 375 degrees of freedom
## Multiple R-squared: 0.8037, Adjusted R-squared: 0.7938
## F-statistic: 80.83 on 19 and 375 DF, p-value: < 2.2e-16
#support vector regression:
set.seed(123)
svm_reg_model = svm(out_of_state_total ~ ., data = train_data)
print(svm_reg_model)
##
## Call:
## svm(formula = out_of_state_total ~ ., data = train_data)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.04761905
## epsilon: 0.1
##
##
## Number of Support Vectors: 260
pred_svm_reg_model = predict(svm_reg_model, test_data)
summary(pred_svm_reg_model)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5686 12923 28139 30687 48779 69167
#Random forest:
set.seed(123)
randomcontrol_reg <- trainControl(
method = "repeatedcv",
number = 4,
repeats = 5,
search = "random")
set.seed(123)
rf_reg_model <- train(out_of_state_total ~ ., data = train_data,
method = "rpart",
trControl = randomcontrol_reg,
tuneLength = 4)
rf_reg_model
## CART
##
## 395 samples
## 21 predictor
##
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 5 times)
## Summary of sample sizes: 297, 296, 295, 297, 296, 297, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 8.346853e-06 9568.980 0.8063095 6156.471
## 1.491750e-04 9566.808 0.8063740 6136.795
## 1.827085e-03 9575.562 0.8056733 6165.172
## 1.049060e-02 9868.411 0.7924379 6781.708
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.000149175.
pred_rf_reg_model <- predict(rf_reg_model, test_data)
#Creating matrix
table<- matrix(0,3,1) # entry, rows, columns
colnames(table) <- c("RMSE")
row.names(table) <- c("Linear_Regression","SVM","Random Forest")
## Computing RMSE
rmse = function(observed, predicted) {
sqrt(mean((observed- predicted)**2))
}
rmse_lm_reg <- rmse(test_data$out_of_state_total, pred_lm_reg_model)
rmse_svm_reg <- rmse(test_data$out_of_state_total, pred_svm_reg_model)
rmse_ranfor_reg <- rmse(test_data$out_of_state_total, pred_rf_reg_model)
table[1,1] <- rmse_lm_reg
table[2,1] <- rmse_svm_reg
table[3,1] <- rmse_ranfor_reg
table <- as.data.frame(table)
print(table)
## RMSE
## Linear_Regression 10164.355
## SVM 8966.781
## Random Forest 9981.496
RMSE (Root Mean Square Error) is a measure to calculate the resudual erros. It measures the standard deviation of the resuduals. Addionally, it shows how concenterated the data is aound the best fit line in a linear model. Here Support Vector Machine has relatively low RMSE value which means the model can make better predictions compared to other models.
One robust approach to optimize model is the cross-validation. cross-validation is an effective resampling technique to avoid over-fitting and increase accuracy of the model.
The potential problems could be,
1. Pandemic or natural disasters.
2. Government Regulations like Visa and approval.
AIC (Akaike’s Information Criteria) and BIC (Bayesian information criteria) are effective metrics for regression model.
AIC: It adds penality for inclusion of any additional feature to the model. As penality increase indicates the error increase when includes. Therefore, the lower the AIC, the better the model.
BIC : Same as AIC. However, there is a strong penality provided here.
AIC and BIC provide unbiased estimates.
Some of the real life issues could be
1. Covid like pandemic impacting the entire world
2. Government policy changes on visa related restrictions.
3. Economical down fall in developing countries might impact the affordibility for the international students.
4. Policy changes like free education, high scholarships provision in other developed countries which will attract more students towards new opportunities.
#Categorization:
train_data_classifier <- mutate (train_data
, out_of_state_total = case_when(train_data$out_of_state_total >= 0 & train_data$out_of_state_total <= 15000 ~ 1
, train_data$out_of_state_total > 15000 & train_data$out_of_state_total <= 30000 ~ 2
, train_data$out_of_state_total > 30000 & train_data$out_of_state_total <= 45000 ~ 3
, train_data$out_of_state_total > 45000 & train_data$out_of_state_total <= 60000 ~ 4
, train_data$out_of_state_total > 60000 ~ 5 ))
test_data_classifier <- mutate (test_data
, out_of_state_total = case_when(test_data$out_of_state_total >= 0 & test_data$out_of_state_total <= 15000 ~ 1
, test_data$out_of_state_total > 15000 & test_data$out_of_state_total <= 30000 ~ 2
, test_data$out_of_state_total > 30000 & test_data$out_of_state_total <= 45000 ~ 3
, test_data$out_of_state_total > 45000 & test_data$out_of_state_total <= 60000 ~ 4
, test_data$out_of_state_total > 60000 ~ 5 ))
# Encoding the target feature as factor
train_data_classifier$out_of_state_total = factor(train_data_classifier$out_of_state_total, levels = c(1,2,3,4,5))
# Encoding the target feature as factor
test_data_classifier$out_of_state_total = factor(test_data_classifier$out_of_state_total, levels = c(1,2,3,4,5))
#SVM:
# Fitting SVM to the Training set
set.seed(123)
svm_class_model = svm(formula = out_of_state_total ~ .,
data = train_data_classifier,
type = 'C-classification',
kernel = 'linear')
# Predicting the Test set results
pred_svm_class_model = predict(svm_class_model, newdata = test_data_classifier[-17])
# Making the Confusion Matrix
pred_svm_class_model_conf_matrix = table(test_data_classifier[, 17], pred_svm_class_model)
#Accuracy
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy_svm <- accuracy(pred_svm_class_model_conf_matrix)
#KNN:
set.seed(123)
knn_class_model <- knn(train = train_data_classifier,
test = test_data_classifier,
cl = train_data_classifier$out_of_state_total,
k = 3)
# Confusion Matrix
knn_class_model_conf_matrix <- table(test_data_classifier$out_of_state_total, knn_class_model)
#Model Evaluation - Choosing K
misClassError <- mean(knn_class_model != test_data_classifier$out_of_state_total)
#Random Forest:
# Create a Random Forest model with default parameters
set.seed(123)
rf_class_model <- randomForest(out_of_state_total~., data=train_data_classifier, proximity=TRUE)
pred_rf_class_model <- predict(rf_class_model, test_data_classifier)
rf_class_model_conf_matrix <- confusionMatrix(pred_rf_class_model, test_data_classifier$out_of_state_total)
plot(rf_class_model)
#2. Create a confusion matrix for each model. Interpret the results and compare them with each other.
#Confusion Matrix for SVM:
pred_svm_class_model_conf_matrix
## pred_svm_class_model
## 1 2 3 4 5
## 1 58 4 3 1 0
## 2 11 12 7 0 0
## 3 2 2 13 7 0
## 4 0 0 4 14 0
## 5 0 0 0 12 18
#Accuracy Score:
accuracy_svm
## [1] 68.45238
SVM provides an accuracy of 68% which is relatively low compared to other models.
#Confusion Matrix for KNN:
knn_class_model_conf_matrix
## knn_class_model
## 1 2 3 4 5
## 1 65 0 1 0 0
## 2 9 19 2 0 0
## 3 1 0 22 1 0
## 4 0 0 0 18 0
## 5 0 0 0 2 28
#Accuracy score:
print(paste('Accuracy =', 1-misClassError))
## [1] "Accuracy = 0.904761904761905"
KNN provides an accuracy of 90% which is relatively low compared to other models.
#Confusion Matrix for Random Forest:
rf_class_model_conf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5
## 1 57 18 3 0 0
## 2 6 9 2 1 0
## 3 2 3 12 1 0
## 4 1 0 5 11 5
## 5 0 0 2 5 25
##
## Overall Statistics
##
## Accuracy : 0.6786
## 95% CI : (0.6023, 0.7484)
## No Information Rate : 0.3929
## P-Value [Acc > NIR] : 6.75e-14
##
## Kappa : 0.5628
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.8636 0.30000 0.50000 0.61111 0.8333
## Specificity 0.7941 0.93478 0.95833 0.92667 0.9493
## Pos Pred Value 0.7308 0.50000 0.66667 0.50000 0.7812
## Neg Pred Value 0.9000 0.86000 0.92000 0.95205 0.9632
## Prevalence 0.3929 0.17857 0.14286 0.10714 0.1786
## Detection Rate 0.3393 0.05357 0.07143 0.06548 0.1488
## Detection Prevalence 0.4643 0.10714 0.10714 0.13095 0.1905
## Balanced Accuracy 0.8289 0.61739 0.72917 0.76889 0.8913
Random Rorest provides an accuracy of 67% which is relatively low compared to other models.
KNN provides relativerly better accuracy of 88%. This is based on the accuracy score provided by the model. There is a scope for improving the accuracy score with parameter tunning of k value.
Evaluation method used here for SVM are the confusion Matrix and accuracy score. confusion Matrix : It is a matrix table which describes the model performance. It is also known as error matrix. accuracy score: It is a measure of correct predictions to total predictions made. Higher the accuracy, effective and robust our model is.